Analyzing NYC High Schools SAT Results


Introduction

One of the most controversial issues in the U.S. educational system is the efficacy of standardized tests, and whether they're unfair to certain groups. Given my prior knowledge of this topic, investigating the correlations between SAT scores and demographics seemed like an interesting angle to take. I could correlate SAT scores with factors like race, gender, income, and more.

The SAT, or Scholastic Aptitude Test, is an exam that U.S. high school students take before applying to college. Colleges take the test scores into account when deciding who to admit, so it's fairly important to perform well on it.

The test consists of three sections, each of which has 800 possible points. The combined score is out of 2,400 possible points (while this number has changed a few times, the data set for our project is based on 2,400 total points). Organizations often rank high schools by their average SAT scores. The scores are also considered a measure of overall school district quality.

New York City makes its data on high school SAT scores available online, as well as the demographics for each high school.

The first few rows of the SAT data look like this:

Further exploration of these datasets revealed that the combination of them didn't give me all the demographic information I wanted to use for my full analysis.

The same website had several related data sets covering demographic information and test scores. Here are the links to all of the data sets I used:

  • SAT scores by school - SAT scores for each high school in New York City
  • School attendance - Attendance information for each school in New York City
  • Class size - Information on class size for each school
  • AP test results - Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject)
  • Graduation outcomes - The percentage of students who graduated, and other outcome information
  • Demographics - Demographic information for each school
  • School survey - Surveys of parents, teachers, and students at each school

Background Research

In my case, I researched the following resources,

and drew a list of observations relevant for my analysis:

  • Only high school students take the SAT, so we'll want to focus on high schools.
  • New York City is made up of five boroughs, which are essentially distinct regions.
  • New York City schools fall within several different school districts, each of which can contains dozens of schools.
  • Our data sets include several different types of schools. We'll need to clean them so that we can focus on high schools only.
  • Each school in New York City has a unique code called a DBN, or district borough number.
  • Aggregating data by district will allow us to use the district mapping data to plot district-by-district differences.

Reading in the data

Here are all of the files in the folder schools:

In [1]:
import pandas as pd

data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

data = {}

for f in data_files:
    d = pd.read_csv("schools/{0}".format(f))
    key = f.replace(".csv", "")
    data[key] = d

Exploring the SAT Data

In [2]:
data['sat_results'].head()
Out[2]:
DBN SCHOOL NAME Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score
0 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 29 355 404 363
1 01M448 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 91 383 423 366
2 01M450 EAST SIDE COMMUNITY SCHOOL 70 377 402 370
3 01M458 FORSYTH SATELLITE ACADEMY 7 414 401 359
4 01M509 MARTA VALLE HIGH SCHOOL 44 390 433 384

Observations

  • The DBN appears to be a unique ID for each school.
  • The first few rows of names indicate the data is about high schools.
  • There's only a single row for each high school, so each DBN is unique in the SAT data.
  • Combining three columns that contain SAT scores -- SAT Critical Reading Avg., Score SAT Math Avg. Score, and SAT Writing Avg. Score -- into a single column would make the scores easier to analyze.

ap_2010

In [3]:
data['ap_2010'].head()
Out[3]:
DBN SchoolName AP Test Takers Total Exams Taken Number of Exams with scores 3 4 or 5
0 01M448 UNIVERSITY NEIGHBORHOOD H.S. 39 49 10
1 01M450 EAST SIDE COMMUNITY HS 19 21 s
2 01M515 LOWER EASTSIDE PREP 24 26 24
3 01M539 NEW EXPLORATIONS SCI,TECH,MATH 255 377 191
4 02M296 High School of Hospitality Management s s s
In [4]:
data['class_size'].head()
Out[4]:
CSD BOROUGH SCHOOL CODE SCHOOL NAME GRADE PROGRAM TYPE CORE SUBJECT (MS CORE and 9-12 ONLY) CORE COURSE (MS CORE and 9-12 ONLY) SERVICE CATEGORY(K-9* ONLY) NUMBER OF STUDENTS / SEATS FILLED NUMBER OF SECTIONS AVERAGE CLASS SIZE SIZE OF SMALLEST CLASS SIZE OF LARGEST CLASS DATA SOURCE SCHOOLWIDE PUPIL-TEACHER RATIO
0 1 M M015 P.S. 015 Roberto Clemente 0K GEN ED - - - 19.0 1.0 19.0 19.0 19.0 ATS NaN
1 1 M M015 P.S. 015 Roberto Clemente 0K CTT - - - 21.0 1.0 21.0 21.0 21.0 ATS NaN
2 1 M M015 P.S. 015 Roberto Clemente 01 GEN ED - - - 17.0 1.0 17.0 17.0 17.0 ATS NaN
3 1 M M015 P.S. 015 Roberto Clemente 01 CTT - - - 17.0 1.0 17.0 17.0 17.0 ATS NaN
4 1 M M015 P.S. 015 Roberto Clemente 02 GEN ED - - - 15.0 1.0 15.0 15.0 15.0 ATS NaN

demographics

In [5]:
data['demographics'].head()
Out[5]:
DBN Name schoolyear fl_percent frl_percent total_enrollment prek k grade1 grade2 ... black_num black_per hispanic_num hispanic_per white_num white_per male_num male_per female_num female_per
0 01M015 P.S. 015 ROBERTO CLEMENTE 20052006 89.4 NaN 281 15 36 40 33 ... 74 26.3 189 67.3 5 1.8 158.0 56.2 123.0 43.8
1 01M015 P.S. 015 ROBERTO CLEMENTE 20062007 89.4 NaN 243 15 29 39 38 ... 68 28.0 153 63.0 4 1.6 140.0 57.6 103.0 42.4
2 01M015 P.S. 015 ROBERTO CLEMENTE 20072008 89.4 NaN 261 18 43 39 36 ... 77 29.5 157 60.2 7 2.7 143.0 54.8 118.0 45.2
3 01M015 P.S. 015 ROBERTO CLEMENTE 20082009 89.4 NaN 252 17 37 44 32 ... 75 29.8 149 59.1 7 2.8 149.0 59.1 103.0 40.9
4 01M015 P.S. 015 ROBERTO CLEMENTE 20092010 96.5 208 16 40 28 32 ... 67 32.2 118 56.7 6 2.9 124.0 59.6 84.0 40.4

5 rows × 38 columns

In [6]:
data['hs_directory'].head()
Out[6]:
dbn school_name borough building_code phone_number fax_number grade_span_min grade_span_max expgrade_span_min expgrade_span_max ... priority08 priority09 priority10 Location 1 Community Board Council District Census Tract BIN BBL NTA
0 17K548 Brooklyn School for Music & Theatre Brooklyn K440 718-230-6250 718-230-6262 9.0 12 NaN NaN ... NaN NaN NaN 883 Classon Avenue\nBrooklyn, NY 11225\n(40.67... 9.0 35.0 213.0 3029686.0 3.011870e+09 Crown Heights South ...
1 09X543 High School for Violin and Dance Bronx X400 718-842-0687 718-589-9849 9.0 12 NaN NaN ... NaN NaN NaN 1110 Boston Road\nBronx, NY 10456\n(40.8276026... 3.0 16.0 135.0 2004526.0 2.026340e+09 Morrisania-Melrose ...
2 09X327 Comprehensive Model School Project M.S. 327 Bronx X240 718-294-8111 718-294-8109 6.0 12 NaN NaN ... NaN NaN NaN 1501 Jerome Avenue\nBronx, NY 10452\n(40.84241... 4.0 14.0 209.0 2008336.0 2.028590e+09 West Concourse ...
3 02M280 Manhattan Early College School for Advertising Manhattan M520 718-935-3477 NaN 9.0 10 9.0 14.0 ... NaN NaN NaN 411 Pearl Street\nNew York, NY 10038\n(40.7106... 1.0 1.0 29.0 1001388.0 1.001130e+09 Chinatown ...
4 28Q680 Queens Gateway to Health Sciences Secondary Sc... Queens Q695 718-969-3155 718-969-3552 6.0 12 NaN NaN ... NaN NaN NaN 160 20 Goethals Avenue\nJamaica, NY 11432\n(40... 8.0 24.0 1267.0 4539721.0 4.068580e+09 Pomonok-Flushing Heights-Hillcrest ...

5 rows × 64 columns

In [7]:
data['sat_results'].head()
Out[7]:
DBN SCHOOL NAME Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score
0 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 29 355 404 363
1 01M448 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 91 383 423 366
2 01M450 EAST SIDE COMMUNITY SCHOOL 70 377 402 370
3 01M458 FORSYTH SATELLITE ACADEMY 7 414 401 359
4 01M509 MARTA VALLE HIGH SCHOOL 44 390 433 384

Observations

  • Each data set appears to either have a DBN column, or the information needed to create one. That means I can use a DBN column to combine the data sets. First I'll pinpoint matching rows from different data sets by looking for identical DBNs, then group all of their columns together in a single data set.
  • Some fields look interesting for mapping -- particularly Location 1, which contains coordinates inside a larger string.
  • Some of the data sets appear to contain multiple rows for each school (because the rows have duplicate DBN values). That means I’ll have to do some preprocessing to ensure that each DBN is unique within each data set. Failure to do this would make combining the data sets problematic because I might be merging two rows in one data set with one row in another data set.

Reading in the survey data

A part of each of the survey text files looks like this:

dbn bn schoolname d75 studentssurveyed highschool schooltype rr_s "01M015" "M015" "P.S. 015 Roberto Clemente" 0 "No" 0 "Elementary School" 88

In [8]:
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding="windows-1252")
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding="windows-1252")

survey = pd.concat([all_survey, d75_survey], axis = 0)

survey.head()
Out[8]:
N_p N_s N_t aca_p_11 aca_s_11 aca_t_11 aca_tot_11 bn com_p_11 com_s_11 ... t_q8c_1 t_q8c_2 t_q8c_3 t_q8c_4 t_q9 t_q9_1 t_q9_2 t_q9_3 t_q9_4 t_q9_5
0 90.0 NaN 22.0 7.8 NaN 7.9 7.9 M015 7.6 NaN ... 29.0 67.0 5.0 0.0 NaN 5.0 14.0 52.0 24.0 5.0
1 161.0 NaN 34.0 7.8 NaN 9.1 8.4 M019 7.6 NaN ... 74.0 21.0 6.0 0.0 NaN 3.0 6.0 3.0 78.0 9.0
2 367.0 NaN 42.0 8.6 NaN 7.5 8.0 M020 8.3 NaN ... 33.0 35.0 20.0 13.0 NaN 3.0 5.0 16.0 70.0 5.0
3 151.0 145.0 29.0 8.5 7.4 7.8 7.9 M034 8.2 5.9 ... 21.0 45.0 28.0 7.0 NaN 0.0 18.0 32.0 39.0 11.0
4 90.0 NaN 23.0 7.9 NaN 8.1 8.0 M063 7.9 NaN ... 59.0 36.0 5.0 0.0 NaN 10.0 5.0 10.0 60.0 15.0

5 rows × 2773 columns

Observations

  • There are over 2000 columns, nearly all of which I don't need. I'll have to filter the data to remove the unnecessary ones using the survey data dictionary.
  • The survey data has a dbn column that I'll convert to uppercase (DBN) to ensure column name consistency with the other data sets.

Cleaning up the Surveys

Based on the dictionary, the relevant columns are:

  • dbn
  • rr_s
  • rr_t
  • rr_p
  • N_s
  • N_t
  • N_p
  • saf_p_11
  • com_p_11
  • eng_p_11
  • aca_p_11
  • saf_t_11
  • com_t_11
  • eng_t_11
  • aca_t_11
  • saf_s_11
  • com_s_11
  • eng_s_11
  • aca_s_11
  • saf_tot_11
  • com_tot_11
  • eng_tot_11
  • aca_tot_11
In [9]:
survey['DBN'] = survey['dbn'] 
survey = survey.loc[:,["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]]
data['survey'] = survey

data['survey'].shape
Out[9]:
(1702, 23)

Inserting DBN Fields

Exploration of the remaining data, showed me that certain dataframes like class_size and hs_directory, don't have a DBN column. hs_directory does have a dbn column, which can be renamed to DBN.

From looking at the first few rows in sat_results, we can tell that the DBN in the sat_results data is just a combination of the CSD and SCHOOL CODE columns in the class_size data. The main difference is that the DBN is padded, so that the CSD portion of it always consists of two digits.

In [10]:
data['hs_directory']['DBN'] = data['hs_directory']['dbn']

def padding_csd(num):
    num = str(num)
    if len(num) == 2:
        return num
    elif len(num) == 1:
        return num.zfill(2)
        
data['class_size']['padded_csd'] = data['class_size']['CSD'].apply(padding_csd)
data['class_size']['DBN'] = data['class_size']['padded_csd'] + data['class_size']['SCHOOL CODE']

data['class_size']['DBN'].head()
Out[10]:
0    01M015
1    01M015
2    01M015
3    01M015
4    01M015
Name: DBN, dtype: object

Combining the SAT Scores

In [11]:
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']

for col in cols:
    data['sat_results'][col] = pd.to_numeric(data['sat_results'][col], errors = 'coerce')
    
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

data['sat_results']['sat_score'].head()
Out[11]:
0    1122.0
1    1172.0
2    1149.0
3    1174.0
4    1207.0
Name: sat_score, dtype: float64

Parsing Geographic Coordinates for Schools

In [12]:
data['hs_directory']['Location 1'].head()
Out[12]:
0    883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...
1    1110 Boston Road\nBronx, NY 10456\n(40.8276026...
2    1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...
3    411 Pearl Street\nNew York, NY 10038\n(40.7106...
4    160 20 Goethals Avenue\nJamaica, NY 11432\n(40...
Name: Location 1, dtype: object

Computing Latitudes

In [13]:
import re

def find_lat(loc):
    coords = re.findall("\(.+\)", loc)
    lat = coords[0].split(',')[0].replace("(", "")
    return lat

data['hs_directory']['lat'] = data['hs_directory']['Location 1'].apply(find_lat)

data['hs_directory'].head()
Out[13]:
dbn school_name borough building_code phone_number fax_number grade_span_min grade_span_max expgrade_span_min expgrade_span_max ... priority10 Location 1 Community Board Council District Census Tract BIN BBL NTA DBN lat
0 17K548 Brooklyn School for Music & Theatre Brooklyn K440 718-230-6250 718-230-6262 9.0 12 NaN NaN ... NaN 883 Classon Avenue\nBrooklyn, NY 11225\n(40.67... 9.0 35.0 213.0 3029686.0 3.011870e+09 Crown Heights South ... 17K548 40.67029890700047
1 09X543 High School for Violin and Dance Bronx X400 718-842-0687 718-589-9849 9.0 12 NaN NaN ... NaN 1110 Boston Road\nBronx, NY 10456\n(40.8276026... 3.0 16.0 135.0 2004526.0 2.026340e+09 Morrisania-Melrose ... 09X543 40.8276026690005
2 09X327 Comprehensive Model School Project M.S. 327 Bronx X240 718-294-8111 718-294-8109 6.0 12 NaN NaN ... NaN 1501 Jerome Avenue\nBronx, NY 10452\n(40.84241... 4.0 14.0 209.0 2008336.0 2.028590e+09 West Concourse ... 09X327 40.842414068000494
3 02M280 Manhattan Early College School for Advertising Manhattan M520 718-935-3477 NaN 9.0 10 9.0 14.0 ... NaN 411 Pearl Street\nNew York, NY 10038\n(40.7106... 1.0 1.0 29.0 1001388.0 1.001130e+09 Chinatown ... 02M280 40.71067947100045
4 28Q680 Queens Gateway to Health Sciences Secondary Sc... Queens Q695 718-969-3155 718-969-3552 6.0 12 NaN NaN ... NaN 160 20 Goethals Avenue\nJamaica, NY 11432\n(40... 8.0 24.0 1267.0 4539721.0 4.068580e+09 Pomonok-Flushing Heights-Hillcrest ... 28Q680 40.718810094000446

5 rows × 66 columns

Computing Longitudes

In [14]:
def find_long(loc):
    coords = re.findall("\(.+\)", loc)
    long = coords[0].split(',')[1].replace(")", "")
    return long

data['hs_directory']['long'] = data['hs_directory']['Location 1'].apply(find_long)

data['hs_directory']['lat'] = pd.to_numeric(data['hs_directory']['lat'], errors = 'coerce')
data['hs_directory']['long'] = pd.to_numeric(data['hs_directory']['long'], errors = 'coerce')

data['hs_directory'].head()
Out[14]:
dbn school_name borough building_code phone_number fax_number grade_span_min grade_span_max expgrade_span_min expgrade_span_max ... Location 1 Community Board Council District Census Tract BIN BBL NTA DBN lat long
0 17K548 Brooklyn School for Music & Theatre Brooklyn K440 718-230-6250 718-230-6262 9.0 12 NaN NaN ... 883 Classon Avenue\nBrooklyn, NY 11225\n(40.67... 9.0 35.0 213.0 3029686.0 3.011870e+09 Crown Heights South ... 17K548 40.670299 -73.961648
1 09X543 High School for Violin and Dance Bronx X400 718-842-0687 718-589-9849 9.0 12 NaN NaN ... 1110 Boston Road\nBronx, NY 10456\n(40.8276026... 3.0 16.0 135.0 2004526.0 2.026340e+09 Morrisania-Melrose ... 09X543 40.827603 -73.904475
2 09X327 Comprehensive Model School Project M.S. 327 Bronx X240 718-294-8111 718-294-8109 6.0 12 NaN NaN ... 1501 Jerome Avenue\nBronx, NY 10452\n(40.84241... 4.0 14.0 209.0 2008336.0 2.028590e+09 West Concourse ... 09X327 40.842414 -73.916162
3 02M280 Manhattan Early College School for Advertising Manhattan M520 718-935-3477 NaN 9.0 10 9.0 14.0 ... 411 Pearl Street\nNew York, NY 10038\n(40.7106... 1.0 1.0 29.0 1001388.0 1.001130e+09 Chinatown ... 02M280 40.710679 -74.000807
4 28Q680 Queens Gateway to Health Sciences Secondary Sc... Queens Q695 718-969-3155 718-969-3552 6.0 12 NaN NaN ... 160 20 Goethals Avenue\nJamaica, NY 11432\n(40... 8.0 24.0 1267.0 4539721.0 4.068580e+09 Pomonok-Flushing Heights-Hillcrest ... 28Q680 40.718810 -73.806500

5 rows × 67 columns

Condensing the Class Size Data Set

In [15]:
data['class_size'].head()
Out[15]:
CSD BOROUGH SCHOOL CODE SCHOOL NAME GRADE PROGRAM TYPE CORE SUBJECT (MS CORE and 9-12 ONLY) CORE COURSE (MS CORE and 9-12 ONLY) SERVICE CATEGORY(K-9* ONLY) NUMBER OF STUDENTS / SEATS FILLED NUMBER OF SECTIONS AVERAGE CLASS SIZE SIZE OF SMALLEST CLASS SIZE OF LARGEST CLASS DATA SOURCE SCHOOLWIDE PUPIL-TEACHER RATIO padded_csd DBN
0 1 M M015 P.S. 015 Roberto Clemente 0K GEN ED - - - 19.0 1.0 19.0 19.0 19.0 ATS NaN 01 01M015
1 1 M M015 P.S. 015 Roberto Clemente 0K CTT - - - 21.0 1.0 21.0 21.0 21.0 ATS NaN 01 01M015
2 1 M M015 P.S. 015 Roberto Clemente 01 GEN ED - - - 17.0 1.0 17.0 17.0 17.0 ATS NaN 01 01M015
3 1 M M015 P.S. 015 Roberto Clemente 01 CTT - - - 17.0 1.0 17.0 17.0 17.0 ATS NaN 01 01M015
4 1 M M015 P.S. 015 Roberto Clemente 02 GEN ED - - - 15.0 1.0 15.0 15.0 15.0 ATS NaN 01 01M015
In [16]:
# Fix key error for 'GRADE '
data['class_size']['GRADE'] = data['class_size']['GRADE ']

# Exploring unique values for 'GRADE'
data['class_size']['GRADE'].unique()
Out[16]:
array(['0K', '01', '02', '03', '04', '05', '0K-09', nan, '06', '07', '08',
       'MS Core', '09-12', '09'], dtype=object)

Because we're dealing with high schools, we're only concerned with grades 9 through 12

In [17]:
# Exploring unique values for 'PROGRAM TYPE'
data['class_size']['PROGRAM TYPE'].unique()
Out[17]:
array(['GEN ED', 'CTT', 'SPEC ED', nan, 'G&T'], dtype=object)

Each school can have multiple program types. Because GEN ED is the largest category by far, I only selected rows where PROGRAM TYPE is GEN ED.

In [18]:
class_size = data['class_size']

class_size = class_size[class_size['GRADE'] == '09-12' ]
class_size = class_size[class_size['PROGRAM TYPE'] == 'GEN ED' ]

class_size.head()
Out[18]:
CSD BOROUGH SCHOOL CODE SCHOOL NAME GRADE PROGRAM TYPE CORE SUBJECT (MS CORE and 9-12 ONLY) CORE COURSE (MS CORE and 9-12 ONLY) SERVICE CATEGORY(K-9* ONLY) NUMBER OF STUDENTS / SEATS FILLED NUMBER OF SECTIONS AVERAGE CLASS SIZE SIZE OF SMALLEST CLASS SIZE OF LARGEST CLASS DATA SOURCE SCHOOLWIDE PUPIL-TEACHER RATIO padded_csd DBN GRADE
225 1 M M292 Henry Street School for International Studies 09-12 GEN ED ENGLISH English 9 - 63.0 3.0 21.0 19.0 25.0 STARS NaN 01 01M292 09-12
226 1 M M292 Henry Street School for International Studies 09-12 GEN ED ENGLISH English 10 - 79.0 3.0 26.3 24.0 31.0 STARS NaN 01 01M292 09-12
227 1 M M292 Henry Street School for International Studies 09-12 GEN ED ENGLISH English 11 - 38.0 2.0 19.0 16.0 22.0 STARS NaN 01 01M292 09-12
228 1 M M292 Henry Street School for International Studies 09-12 GEN ED ENGLISH English 12 - 69.0 3.0 23.0 13.0 30.0 STARS NaN 01 01M292 09-12
229 1 M M292 Henry Street School for International Studies 09-12 GEN ED MATH Integrated Algebra - 53.0 3.0 17.7 16.0 21.0 STARS NaN 01 01M292 09-12

DBN still isn't completely unique. This is due to the CORE COURSE (MS CORE and 9-12 ONLY) and CORE SUBJECT (MS CORE and 9-12 ONLY) columns.

In [19]:
# Exploring unique values for 'CORE SUBJECT (MS CORE and 9-12 ONLY)'
data['class_size']['CORE SUBJECT (MS CORE and 9-12 ONLY)'].unique()
Out[19]:
array(['-', nan, 'ENGLISH', 'MATH', 'SCIENCE', 'SOCIAL STUDIES'], dtype=object)

Computing Average Class Sizes

In [20]:
import numpy as np

class_size = class_size.groupby('DBN').agg(np.mean)
class_size.reset_index(inplace = True)
data["class_size"] = class_size

data["class_size"].head()
Out[20]:
DBN CSD NUMBER OF STUDENTS / SEATS FILLED NUMBER OF SECTIONS AVERAGE CLASS SIZE SIZE OF SMALLEST CLASS SIZE OF LARGEST CLASS SCHOOLWIDE PUPIL-TEACHER RATIO
0 01M292 1 88.0000 4.000000 22.564286 18.50 26.571429 NaN
1 01M332 1 46.0000 2.000000 22.000000 21.00 23.500000 NaN
2 01M378 1 33.0000 1.000000 33.000000 33.00 33.000000 NaN
3 01M448 1 105.6875 4.750000 22.231250 18.25 27.062500 NaN
4 01M450 1 57.6000 2.733333 21.200000 19.40 22.866667 NaN

Condensing the Demographics Data Set

In [21]:
data["demographics"].head(7)
Out[21]:
DBN Name schoolyear fl_percent frl_percent total_enrollment prek k grade1 grade2 ... black_num black_per hispanic_num hispanic_per white_num white_per male_num male_per female_num female_per
0 01M015 P.S. 015 ROBERTO CLEMENTE 20052006 89.4 NaN 281 15 36 40 33 ... 74 26.3 189 67.3 5 1.8 158.0 56.2 123.0 43.8
1 01M015 P.S. 015 ROBERTO CLEMENTE 20062007 89.4 NaN 243 15 29 39 38 ... 68 28.0 153 63.0 4 1.6 140.0 57.6 103.0 42.4
2 01M015 P.S. 015 ROBERTO CLEMENTE 20072008 89.4 NaN 261 18 43 39 36 ... 77 29.5 157 60.2 7 2.7 143.0 54.8 118.0 45.2
3 01M015 P.S. 015 ROBERTO CLEMENTE 20082009 89.4 NaN 252 17 37 44 32 ... 75 29.8 149 59.1 7 2.8 149.0 59.1 103.0 40.9
4 01M015 P.S. 015 ROBERTO CLEMENTE 20092010 96.5 208 16 40 28 32 ... 67 32.2 118 56.7 6 2.9 124.0 59.6 84.0 40.4
5 01M015 P.S. 015 ROBERTO CLEMENTE 20102011 96.5 203 13 37 35 33 ... 75 36.9 110 54.2 4 2.0 113.0 55.7 90.0 44.3
6 01M015 P.S. 015 ROBERTO CLEMENTE 20112012 NaN 89.4 189 13 31 35 28 ... 63 33.3 109 57.7 4 2.1 97.0 51.3 92.0 48.7

7 rows × 38 columns

In [22]:
data["demographics"]['schoolyear'].dtype
Out[22]:
dtype('int64')
In [23]:
data['demographics'] = data["demographics"][data["demographics"]['schoolyear'] == 20112012]
data['demographics'].head()
Out[23]:
DBN Name schoolyear fl_percent frl_percent total_enrollment prek k grade1 grade2 ... black_num black_per hispanic_num hispanic_per white_num white_per male_num male_per female_num female_per
6 01M015 P.S. 015 ROBERTO CLEMENTE 20112012 NaN 89.4 189 13 31 35 28 ... 63 33.3 109 57.7 4 2.1 97.0 51.3 92.0 48.7
13 01M019 P.S. 019 ASHER LEVY 20112012 NaN 61.5 328 32 46 52 54 ... 81 24.7 158 48.2 28 8.5 147.0 44.8 181.0 55.2
20 01M020 PS 020 ANNA SILVER 20112012 NaN 92.5 626 52 102 121 87 ... 55 8.8 357 57.0 16 2.6 330.0 52.7 296.0 47.3
27 01M034 PS 034 FRANKLIN D ROOSEVELT 20112012 NaN 99.7 401 14 34 38 36 ... 90 22.4 275 68.6 8 2.0 204.0 50.9 197.0 49.1
35 01M063 PS 063 WILLIAM MCKINLEY 20112012 NaN 78.9 176 18 20 30 21 ... 41 23.3 110 62.5 15 8.5 97.0 55.1 79.0 44.9

5 rows × 38 columns

Condensing the Graduation Data Set

In [24]:
data['graduation'].head()
Out[24]:
Demographic DBN School Name Cohort Total Cohort Total Grads - n Total Grads - % of cohort Total Regents - n Total Regents - % of cohort Total Regents - % of grads ... Regents w/o Advanced - n Regents w/o Advanced - % of cohort Regents w/o Advanced - % of grads Local - n Local - % of cohort Local - % of grads Still Enrolled - n Still Enrolled - % of cohort Dropped Out - n Dropped Out - % of cohort
0 Total Cohort 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL 2003 5 s s s s s ... s s s s s s s s s s
1 Total Cohort 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL 2004 55 37 67.3% 17 30.9% 45.9% ... 17 30.9% 45.9% 20 36.4% 54.1% 15 27.3% 3 5.5%
2 Total Cohort 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL 2005 64 43 67.2% 27 42.2% 62.8% ... 27 42.2% 62.8% 16 25% 37.200000000000003% 9 14.1% 9 14.1%
3 Total Cohort 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL 2006 78 43 55.1% 36 46.2% 83.7% ... 36 46.2% 83.7% 7 9% 16.3% 16 20.5% 11 14.1%
4 Total Cohort 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL 2006 Aug 78 44 56.4% 37 47.4% 84.1% ... 37 47.4% 84.1% 7 9% 15.9% 15 19.2% 11 14.1%

5 rows × 23 columns

In [25]:
data['graduation']['Demographic'].unique()
Out[25]:
array(['Total Cohort', 'Male', 'English Language Learners',
       'English Proficient Students', 'Special Education Students',
       'General Education Students', 'Asian', 'Black', 'Hispanic', 'White',
       'Female'], dtype=object)
In [26]:
data['graduation']['Cohort'].unique()
Out[26]:
array(['2003', '2004', '2005', '2006', '2006 Aug', '2001', '2002'], dtype=object)
In [27]:
data['graduation'] = data['graduation'][data['graduation']['Cohort'] == '2006']
data['graduation'] = data['graduation'][data['graduation']['Demographic'] == 'Total Cohort']
data['graduation'].head()
Out[27]:
Demographic DBN School Name Cohort Total Cohort Total Grads - n Total Grads - % of cohort Total Regents - n Total Regents - % of cohort Total Regents - % of grads ... Regents w/o Advanced - n Regents w/o Advanced - % of cohort Regents w/o Advanced - % of grads Local - n Local - % of cohort Local - % of grads Still Enrolled - n Still Enrolled - % of cohort Dropped Out - n Dropped Out - % of cohort
3 Total Cohort 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL 2006 78 43 55.1% 36 46.2% 83.7% ... 36 46.2% 83.7% 7 9% 16.3% 16 20.5% 11 14.1%
10 Total Cohort 01M448 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 2006 124 53 42.7% 42 33.9% 79.2% ... 34 27.4% 64.2% 11 8.9% 20.8% 46 37.1% 20 16.100000000000001%
17 Total Cohort 01M450 EAST SIDE COMMUNITY SCHOOL 2006 90 70 77.8% 67 74.400000000000006% 95.7% ... 67 74.400000000000006% 95.7% 3 3.3% 4.3% 15 16.7% 5 5.6%
24 Total Cohort 01M509 MARTA VALLE HIGH SCHOOL 2006 84 47 56% 40 47.6% 85.1% ... 23 27.4% 48.9% 7 8.300000000000001% 14.9% 25 29.8% 5 6%
31 Total Cohort 01M515 LOWER EAST SIDE PREPARATORY HIGH SCHO 2006 193 105 54.4% 91 47.2% 86.7% ... 22 11.4% 21% 14 7.3% 13.3% 53 27.5% 35 18.100000000000001%

5 rows × 23 columns

Converting AP Test Scores

High school students take the Advanced Placement (AP) exams before applying to college. There are several AP exams, each corresponding to a school subject. High school students who earn high scores may receive college credit.

AP exams have a 1 to 5 scale; 3 or higher is a passing score. Many high school students take AP exams -- particularly those who attend academically challenging institutions. AP exams are much more rare in schools that lack funding or academic rigor.

It will be interesting to find out whether AP exam scores are correlated with SAT scores across high schools.

In [28]:
data['ap_2010'].head()
Out[28]:
DBN SchoolName AP Test Takers Total Exams Taken Number of Exams with scores 3 4 or 5
0 01M448 UNIVERSITY NEIGHBORHOOD H.S. 39 49 10
1 01M450 EAST SIDE COMMUNITY HS 19 21 s
2 01M515 LOWER EASTSIDE PREP 24 26 24
3 01M539 NEW EXPLORATIONS SCI,TECH,MATH 255 377 191
4 02M296 High School of Hospitality Management s s s
In [29]:
list(data['ap_2010'])
Out[29]:
['DBN',
 'SchoolName',
 'AP Test Takers ',
 'Total Exams Taken',
 'Number of Exams with scores 3 4 or 5']
In [30]:
data['ap_2010']['AP Test Takers'] = data['ap_2010']['AP Test Takers ']

cols = ['AP Test Takers', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    data['ap_2010'][col] = pd.to_numeric(data['ap_2010'][col], errors="coerce")
    
data['ap_2010'].head()
Out[30]:
DBN SchoolName AP Test Takers Total Exams Taken Number of Exams with scores 3 4 or 5 AP Test Takers
0 01M448 UNIVERSITY NEIGHBORHOOD H.S. 39 49.0 10.0 39.0
1 01M450 EAST SIDE COMMUNITY HS 19 21.0 NaN 19.0
2 01M515 LOWER EASTSIDE PREP 24 26.0 24.0 24.0
3 01M539 NEW EXPLORATIONS SCI,TECH,MATH 255 377.0 191.0 255.0
4 02M296 High School of Hospitality Management s NaN NaN NaN

Performing the Left Joins

In [31]:
combined = data["sat_results"]
combined = combined.merge(data['ap_2010'], how = "left", on = 'DBN')
combined = combined.merge(data['graduation'], how = "left", on = 'DBN')
combined.head()
Out[31]:
DBN SCHOOL NAME Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score SchoolName AP Test Takers Total Exams Taken ... Regents w/o Advanced - n Regents w/o Advanced - % of cohort Regents w/o Advanced - % of grads Local - n Local - % of cohort Local - % of grads Still Enrolled - n Still Enrolled - % of cohort Dropped Out - n Dropped Out - % of cohort
0 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 29 355.0 404.0 363.0 1122.0 NaN NaN NaN ... 36 46.2% 83.7% 7 9% 16.3% 16 20.5% 11 14.1%
1 01M448 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 91 383.0 423.0 366.0 1172.0 UNIVERSITY NEIGHBORHOOD H.S. 39 49.0 ... 34 27.4% 64.2% 11 8.9% 20.8% 46 37.1% 20 16.100000000000001%
2 01M450 EAST SIDE COMMUNITY SCHOOL 70 377.0 402.0 370.0 1149.0 EAST SIDE COMMUNITY HS 19 21.0 ... 67 74.400000000000006% 95.7% 3 3.3% 4.3% 15 16.7% 5 5.6%
3 01M458 FORSYTH SATELLITE ACADEMY 7 414.0 401.0 359.0 1174.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 01M509 MARTA VALLE HIGH SCHOOL 44 390.0 433.0 384.0 1207.0 NaN NaN NaN ... 23 27.4% 48.9% 7 8.300000000000001% 14.9% 25 29.8% 5 6%

5 rows × 34 columns

In [32]:
combined.shape
Out[32]:
(479, 34)

Performing the Inner Joins

In [33]:
to_merge = ["class_size", "demographics", "survey", "hs_directory"]

for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")
    
combined.head()
Out[33]:
DBN SCHOOL NAME Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score SchoolName AP Test Takers Total Exams Taken ... priority10 Location 1 Community Board Council District Census Tract BIN BBL NTA lat long
0 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 29 355.0 404.0 363.0 1122.0 NaN NaN NaN ... NaN 220 Henry Street\nNew York, NY 10002\n(40.7137... 3.0 1.0 201.0 1003223.0 1.002690e+09 Lower East Side ... 40.713764 -73.985260
1 01M448 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 91 383.0 423.0 366.0 1172.0 UNIVERSITY NEIGHBORHOOD H.S. 39 49.0 ... NaN 200 Monroe Street\nNew York, NY 10002\n(40.712... 3.0 1.0 202.0 1003214.0 1.002590e+09 Lower East Side ... 40.712332 -73.984797
2 01M450 EAST SIDE COMMUNITY SCHOOL 70 377.0 402.0 370.0 1149.0 EAST SIDE COMMUNITY HS 19 21.0 ... NaN 420 East 12 Street\nNew York, NY 10009\n(40.72... 3.0 2.0 34.0 1005974.0 1.004390e+09 East Village ... 40.729783 -73.983041
3 01M509 MARTA VALLE HIGH SCHOOL 44 390.0 433.0 384.0 1207.0 NaN NaN NaN ... NaN 145 Stanton Street\nNew York, NY 10002\n(40.72... 3.0 1.0 3001.0 1004323.0 1.003540e+09 Chinatown ... 40.720569 -73.985673
4 01M539 NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ... 159 522.0 574.0 525.0 1621.0 NEW EXPLORATIONS SCI,TECH,MATH 255 377.0 ... NaN 111 Columbia Street\nNew York, NY 10002\n(40.7... 3.0 2.0 2201.0 1004070.0 1.003350e+09 Lower East Side ... 40.718725 -73.979426

5 rows × 166 columns

In [34]:
combined.shape
Out[34]:
(363, 166)

Filling in Missing Values

In [35]:
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)

combined.head()
Out[35]:
DBN SCHOOL NAME Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score SchoolName AP Test Takers Total Exams Taken ... priority10 Location 1 Community Board Council District Census Tract BIN BBL NTA lat long
0 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 29 355.0 404.0 363.0 1122.0 0 0 197.038462 ... 0 220 Henry Street\nNew York, NY 10002\n(40.7137... 3.0 1.0 201.0 1003223.0 1.002690e+09 Lower East Side ... 40.713764 -73.985260
1 01M448 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 91 383.0 423.0 366.0 1172.0 UNIVERSITY NEIGHBORHOOD H.S. 39 49.000000 ... 0 200 Monroe Street\nNew York, NY 10002\n(40.712... 3.0 1.0 202.0 1003214.0 1.002590e+09 Lower East Side ... 40.712332 -73.984797
2 01M450 EAST SIDE COMMUNITY SCHOOL 70 377.0 402.0 370.0 1149.0 EAST SIDE COMMUNITY HS 19 21.000000 ... 0 420 East 12 Street\nNew York, NY 10009\n(40.72... 3.0 2.0 34.0 1005974.0 1.004390e+09 East Village ... 40.729783 -73.983041
3 01M509 MARTA VALLE HIGH SCHOOL 44 390.0 433.0 384.0 1207.0 0 0 197.038462 ... 0 145 Stanton Street\nNew York, NY 10002\n(40.72... 3.0 1.0 3001.0 1004323.0 1.003540e+09 Chinatown ... 40.720569 -73.985673
4 01M539 NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ... 159 522.0 574.0 525.0 1621.0 NEW EXPLORATIONS SCI,TECH,MATH 255 377.000000 ... 0 111 Columbia Street\nNew York, NY 10002\n(40.7... 3.0 2.0 2201.0 1004070.0 1.003350e+09 Lower East Side ... 40.718725 -73.979426

5 rows × 166 columns

Adding a School District Column for Mapping

In [36]:
def first_two(s):
    return s[0:2]

combined['school_dist'] = combined['DBN'].apply(first_two)
combined['school_dist'].head()
Out[36]:
0    01
1    01
2    01
3    01
4    01
Name: school_dist, dtype: object

Finding Correlations With the r Value

In [37]:
correlations = combined.corr()
correlations = correlations['sat_score']
correlations
Out[37]:
SAT Critical Reading Avg. Score         0.986820
SAT Math Avg. Score                     0.972643
SAT Writing Avg. Score                  0.987771
sat_score                               1.000000
Total Exams Taken                       0.514333
Number of Exams with scores 3 4 or 5    0.463245
AP Test Takers                          0.523140
Total Cohort                            0.325144
CSD                                     0.042948
NUMBER OF STUDENTS / SEATS FILLED       0.394626
NUMBER OF SECTIONS                      0.362673
AVERAGE CLASS SIZE                      0.381014
SIZE OF SMALLEST CLASS                  0.249949
SIZE OF LARGEST CLASS                   0.314434
SCHOOLWIDE PUPIL-TEACHER RATIO               NaN
schoolyear                                   NaN
fl_percent                                   NaN
frl_percent                            -0.722225
total_enrollment                        0.367857
ell_num                                -0.153778
ell_percent                            -0.398750
sped_num                                0.034933
sped_percent                           -0.448170
asian_num                               0.475445
asian_per                               0.570730
black_num                               0.027979
black_per                              -0.284139
hispanic_num                            0.025744
hispanic_per                           -0.396985
white_num                               0.449559
                                          ...   
aca_p_11                                0.035155
saf_t_11                                0.313810
com_t_11                                0.082419
eng_t_11                                0.036906
aca_t_11                                0.132348
saf_s_11                                0.337639
com_s_11                                0.187370
eng_s_11                                0.213822
aca_s_11                                0.339435
saf_tot_11                              0.318753
com_tot_11                              0.077310
eng_tot_11                              0.100102
aca_tot_11                              0.190966
grade_span_min                         -0.021315
grade_span_max                               NaN
expgrade_span_min                            NaN
expgrade_span_max                            NaN
postcode                               -0.063977
total_students                          0.407827
number_programs                         0.117012
priority08                                   NaN
priority09                                   NaN
priority10                                   NaN
Community Board                        -0.060919
Council District                       -0.076151
Census Tract                            0.048737
BIN                                     0.052232
BBL                                     0.044427
lat                                    -0.121029
long                                   -0.132222
Name: sat_score, Length: 74, dtype: float64

Observations

  • total_enrollment has a strong positive correlation with sat_score. This is surprising because we'd expect smaller schools where students receive more attention to have higher scores. However, it looks like the opposite is true -- larger schools tend to do better on the SAT.
    • Other columns that are proxies for enrollment correlate similarly. These include total_students, N_s, N_p, N_t, AP Test Takers, Total Exams Taken, and NUMBER OF SECTIONS.
  • Both the percentage of females (female_per) and number of females (female_num) at a school correlate positively with SAT score, whereas the percentage of males (male_per) and the number of males (male_num) correlate negatively. This could indicate that women do better on the SAT than men.
  • Teacher and student ratings of school safety (saf_t_11, and saf_s_11) correlate with sat_score.
  • Student ratings of school academic standards (aca_s_11) correlate with sat_score, but this does not hold for ratings from teachers and parents (aca_p_11 and aca_t_11).
  • There is significant racial inequality in SAT scores (white_per, asian_per, black_per, hispanic_per).
  • The percentage of English language learners at the school (ell_percent, frl_percent) has a strong negative correlation with SAT scores.

Plotting Total Enrollment versus SAT Score

In [38]:
import plotly as py 
import plotly.graph_objs as go
import numpy as np

py.offline.init_notebook_mode(connected=True)

trace = go.Scatter(
    x = combined['total_enrollment'],
    y = combined['sat_score'],
    mode = 'markers', 
    text = combined['SCHOOL NAME']
)

layout = go.Layout(
    title= "total_enrollment vs. sat_score", 
    hovermode= 'closest', 
    xaxis=dict(
    title='total_enrollment',
    ticklen=5,
    zeroline=False,
    gridwidth=2,
    ), 
    yaxis=dict(
    title='sat_score',
    ticklen=5,
    gridwidth=2,
    )  
)

data = go.Data([trace])

fig = go.Figure(data = data, layout = layout)

py.offline.iplot(fig)

Exploring Schools With Low SAT Scores and Enrollment

In [39]:
low_enrollment = combined[combined['total_enrollment'] < 1000] 
low_enrollment = combined[combined['sat_score'] < 1000]
low_enrollment['School Name']
Out[39]:
91       INTERNATIONAL COMMUNITY HIGH SCHOOL
125                                        0
126          BRONX INTERNATIONAL HIGH SCHOOL
139    KINGSBRIDGE INTERNATIONAL HIGH SCHOOL
141    INTERNATIONAL SCHOOL FOR LIBERAL ARTS
176                                        0
179            HIGH SCHOOL OF WORLD CULTURES
188       BROOKLYN INTERNATIONAL HIGH SCHOOL
225    INTERNATIONAL HIGH SCHOOL AT PROSPECT
237               IT TAKES A VILLAGE ACADEMY
253                MULTICULTURAL HIGH SCHOOL
286    PAN AMERICAN INTERNATIONAL HIGH SCHOO
Name: School Name, dtype: object

Observations

Researching the above schools revealed that most of the high schools with low total enrollment and low SAT scores have high percentages of English language learners. This indicates that it's actually ell_percent that correlates strongly with sat_score, rather than total_enrollment.

In [40]:
trace = go.Scatter(
    x = combined['ell_percent'],
    y = combined['sat_score'],
    mode = 'markers', 
    text = combined['SCHOOL NAME']
)

layout = go.Layout(
    title= "ell_percent vs. sat_score", 
    hovermode= 'closest', 
    xaxis=dict(
    title='ell_percent',
    ticklen=5,
    zeroline=False,
    gridwidth=2,
    ), 
    yaxis=dict(
    title='sat_score',
    ticklen=5,
    gridwidth=2,
    )  
)

data = go.Data([trace])

fig = go.Figure(data = data, layout = layout)

py.offline.iplot(fig)

Observations

ell_percent correlates with sat_score more strongly, because the scatterplot is more linear. However, there's still the cluster of schools that have very high ell_percent values and low sat_score values. This cluster represents the same group of international high schools listed earlier.

Mapping the Schools

In [41]:
import folium
from folium.plugins import MarkerCluster

m = folium.Map(location= [combined['lat'].mean(), combined['long'].mean()], zoom_start = 10, tiles='Stamen Terrain')

marker_cluster = MarkerCluster().add_to(m)

for name, row in combined.iterrows():
    popup = folium.Popup("{0}: {1}".format(row["DBN"], row["SCHOOL NAME"]), parse_html=True)
    folium.Marker([row["lat"], row["long"]], popup= popup).add_to(marker_cluster)

m
Out[41]:
In [42]:
m2 = folium.Map(location= [combined['lat'].mean(), combined['long'].mean()], zoom_start = 10, tiles='Stamen Terrain')
m2.add_child(folium.plugins.HeatMap([[row['lat'], row['long']] for name, row in combined.iterrows()]))
m2.save('schools_heatmap.html')
m2
Out[42]:

Observations

It is evident from the marker cluster and heat map that school density is highest in Manhattan, and lower in Brooklyn, the Bronx, Queens, and Staten Island.

Mapping the School Districts

Although heatmaps are good for gradient mapping, it lacks the structure needed to accurately plot out the differences in SAT Scores across the city. Mapping school districts is a better way to visualize this information, as each district has its own administration. New York City has several dozen school districts, and each district is a small geographic area. I mapped the districts using a Choropleth map.

In [43]:
import numpy as np

districts = combined.groupby('school_dist').agg(np.mean)
districts.reset_index(inplace = True)
districts.head()
Out[43]:
school_dist SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score Total Exams Taken Number of Exams with scores 3 4 or 5 AP Test Takers Total Cohort CSD ... priority08 priority09 priority10 Community Board Council District Census Tract BIN BBL lat long
0 01 441.833333 473.333333 439.333333 1354.500000 173.019231 135.800000 116.681090 93.500000 1.0 ... 0.0 0.0 0.0 3.000000 1.500000 1106.833333 1.004144e+06 1.003302e+09 40.719022 -73.982377
1 02 426.619092 444.186256 424.832836 1295.638184 201.516827 157.495833 128.908454 158.647849 2.0 ... 0.0 0.0 0.0 4.204005 2.963296 1132.479744 1.103862e+06 1.070438e+09 40.739699 -73.991386
2 03 428.529851 437.997512 426.915672 1293.443035 244.522436 193.087500 156.183494 183.384409 3.0 ... 0.0 0.0 0.0 7.500000 6.500000 166.666667 1.034931e+06 1.012833e+09 40.781574 -73.977370
3 04 402.142857 416.285714 405.714286 1224.142857 183.879121 151.035714 129.016484 113.857143 4.0 ... 0.0 0.0 0.0 11.000000 8.000000 2637.000000 1.055874e+06 1.016681e+09 40.793449 -73.943215
4 05 427.159915 438.236674 419.666098 1285.062687 115.725275 142.464286 85.722527 143.677419 5.0 ... 0.0 0.0 0.0 9.428571 8.142857 9372.571429 1.063080e+06 1.019721e+09 40.817077 -73.949251

5 rows × 75 columns

In [44]:
# remove leading 0s
districts["school_dist"] = districts["school_dist"].apply(lambda x: str(int(x)))

import os

def district_map(col):
    dist_geo = os.path.join('schools','NYC School District Boundaries.geojson')
    dist = folium.Map(location=[districts['lat'].mean(), districts['long'].mean()], zoom_start=10, tiles='cartodbpositron')
    dist.choropleth(
        geo_data=dist_geo,
        data=districts,
        columns=['school_dist', col],
        key_on='feature.properties.schooldist',
        fill_color='YlOrRd',
        fill_opacity=0.7,
        line_opacity=0.2,
        legend_name=col
    )
    folium.LayerControl().add_to(dist)
    return dist

district_map("sat_score")
Out[44]:

Exploring English Language Learners and SAT Scores

In the scatter plot of english language learners percentage (ell_percent) vs. SAT scores (sat_score) I observed a group of schools with a high ell_percentage that also have low average SAT scores. This can be extended to district level to observe patterns in SAT scores by district.

In [45]:
district_map('ell_percent')
Out[45]:

Observations

Districts with a low proportion of English language learners tend to have high SAT scores, and vice versa.

Correlating Survey Scores and SAT Scores

In [46]:
trace = go.Bar(
    y=combined.corr()['sat_score'][["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", 
       "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", 
       "com_tot_11", "eng_tot_11", "aca_tot_11"]],
    name='Survey Fields',
    text = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", 
       "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", 
       "com_tot_11", "eng_tot_11", "aca_tot_11"]
    )

data = [trace]
layout = go.Layout(
    title='Survey Correlations',
    xaxis=dict(
    title='Survey Fields'
    ), 
    yaxis=dict(
    title='r-value'
    )  
)
    
fig = go.Figure(data=data, layout=layout)

py.offline.iplot(fig)

Observations

There are high correlations between N_s, N_t, N_p and sat_score. Since these columns are correlated with total_enrollment, it makes sense that they would be high.

It is more interesting that rr_s, the student response rate, or the percentage of students that completed the survey, correlates with sat_score. This might make sense because students who are more likely to fill out surveys may be more likely to also be doing well academically.

How students and teachers percieved safety (saf_t_11 and saf_s_11) correlate with sat_score. This make sense, as it's hard to teach or learn in an unsafe environment.

The last interesting correlation is the aca_s_11, which indicates how the student perceives academic standards, correlates with sat_score, but this is not true for aca_t_11, how teachers perceive academic standards, or aca_p_11, how parents perceive academic standards.

Exploring Safety and SAT Scores

In [47]:
trace = go.Scatter(
    x = combined['saf_s_11'],
    y = combined['sat_score'],
    mode = 'markers', 
    text = combined['SCHOOL NAME']
)

layout = go.Layout(
    title= "saf_s_11 vs. sat_score", 
    hovermode= 'closest', 
    xaxis=dict(
    title='ell_percent'
    ), 
    yaxis=dict(
    title='sat_score',
    )  
)

data = go.Data([trace])

fig = go.Figure(data = data, layout = layout)

py.offline.iplot(fig)

Observations

There appears to be a correlation between SAT scores and safety, although it isn't that strong. It looks like there are a few schools with extremely high SAT scores and high safety scores. There are a few schools with low safety scores and low SAT scores. No school with a safety score lower than 6.5 has an average SAT score higher than 1500 or so.

In [48]:
district_map('saf_s_11')
Out[48]:

Observations

It looks like Upper Manhattan and parts of Queens and the Bronx tend to have lower safety scores, whereas Brooklyn has high safety scores.

Exploring Race and SAT Scores

In [49]:
trace = go.Bar(
    y=combined.corr()['sat_score'][["white_per", "asian_per", "black_per", "hispanic_per"]],
    name='Race Correlations',
    text = ["white_per", "asian_per", "black_per", "hispanic_per"]
    )

data = [trace]
layout = go.Layout(
    title='Race Correlations',
    xaxis=dict(
    title='Race Fields'
    ), 
    yaxis=dict(
    title='r-value'
    )  
)
    
fig = go.Figure(data=data, layout=layout)

py.offline.iplot(fig)

Observations

It looks like a higher percentage of white or asian students at a school correlates positively with SAT score, whereas a higher percentage of black or hispanic students correlates negatively with SAT score. This may be due to a lack of funding for schools in certain areas, which are more likely to have a higher percentage of black or hispanic students.

In [50]:
trace = go.Scatter(
    x = combined['hispanic_per'],
    y = combined['sat_score'],
    mode = 'markers', 
    text = combined['SCHOOL NAME']
)

layout = go.Layout(
    title= "hispanic_per vs. sat_score", 
    hovermode= 'closest', 
    xaxis=dict(
    title='hispanic_per'
    ), 
    yaxis=dict(
    title='sat_score',
    )  
)

data = go.Data([trace])

fig = go.Figure(data = data, layout = layout)

py.offline.iplot(fig)
In [51]:
combined[combined["hispanic_per"] > 95]["SCHOOL NAME"]
Out[51]:
44                         MANHATTAN BRIDGES HIGH SCHOOL
82      WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL
89     GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M...
125                  ACADEMY FOR LANGUAGE AND TECHNOLOGY
141                INTERNATIONAL SCHOOL FOR LIBERAL ARTS
176     PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE
253                            MULTICULTURAL HIGH SCHOOL
286               PAN AMERICAN INTERNATIONAL HIGH SCHOOL
Name: SCHOOL NAME, dtype: object

The schools listed above appear to primarily be geared towards recent immigrants in the US. These schools have a lot of students who are learning English, which would explain the lower SAT scores.

In [52]:
combined[(combined["hispanic_per"] < 10) & (combined["sat_score"] > 1800)]["SCHOOL NAME"]
Out[52]:
37                                STUYVESANT HIGH SCHOOL
151                         BRONX HIGH SCHOOL OF SCIENCE
187                       BROOKLYN TECHNICAL HIGH SCHOOL
327    QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO...
356                  STATEN ISLAND TECHNICAL HIGH SCHOOL
Name: SCHOOL NAME, dtype: object

Many of the schools above appear to be specialized science and technology schools that receive extra funding, and only admit students who pass an entrance exam. This doesn't explain the low hispanic_per, but it does explain why their students tend to do better on the SAT -- they are students from all over New York City who did well on a standardized test.

Exploring Gender and SAT Scores

In [53]:
trace = go.Bar(
    y=combined.corr()['sat_score'][["male_per", "female_per"]],
    name='Race Correlations',
    text = ["male_per", "female_per"]
    )

data = [trace]
layout = go.Layout(
    title='Gender Correlations',
    xaxis=dict(
    title='Gender Fields'
    ), 
    yaxis=dict(
    title='r-value'
    )  
)
    
fig = go.Figure(data=data, layout=layout)

py.offline.iplot(fig)

Observations

In the plot above, we can see that a high percentage of females at a school positively correlates with SAT score, whereas a high percentage of males at a school negatively correlates with SAT score. Neither correlation is extremely strong.

In [54]:
trace = go.Scatter(
    x = combined['female_per'],
    y = combined['sat_score'],
    mode = 'markers', 
    text = combined['SCHOOL NAME']
)

layout = go.Layout(
    title= "female_per vs. sat_score", 
    hovermode= 'closest', 
    xaxis=dict(
    title='female_per'
    ), 
    yaxis=dict(
    title='sat_score',
    )  
)

data = go.Data([trace])

fig = go.Figure(data = data, layout = layout)

py.offline.iplot(fig)

Observations

Based on the scatterplot, there doesn't seem to be any real correlation between sat_score and female_per. However, there is a cluster of schools with a high percentage of females (60 to 80), and high SAT scores.

In [55]:
combined[(combined["female_per"] > 60) & (combined["sat_score"] > 1700)]["SCHOOL NAME"]
Out[55]:
5                         BARD HIGH SCHOOL EARLY COLLEGE
26                         ELEANOR ROOSEVELT HIGH SCHOOL
60                                    BEACON HIGH SCHOOL
61     FIORELLO H. LAGUARDIA HIGH SCHOOL OF MUSIC & A...
302                          TOWNSEND HARRIS HIGH SCHOOL
Name: SCHOOL NAME, dtype: object

These schools appears to be very selective liberal arts schools that have high academic standards.